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I n v estigations into the Instructional Process 

II Objectivity of Coding in a Modified Flanders 
Interaction Analysis 



1 . Introduction 



This report deals with the reliability of coding problem 
associated with observation studies. The repeatability of 
videotaped s i tua t i on s • add s new features to such analysis. 

The intention is: 

1) to examine coding reliability by applying the customary 

profile method (Flanders 1SB5, 23-30) to two coding oc- 

casions separated by a lengthy time interval, with the 
object of determining both wi t h in -occa si an reliability 
(agreement) 0f)d 'lubween-occasion reliability (consxancy). 
The coefficients obtained will be considered according 

to 

a) school subjects, 

b) coder pairs and 
ic 1 coding occasions; 

2) to develop a method for the measurement of the relia- 
bility of any one individual category and ter consider 
the coefficients obtained according to 

a) school subjects, 

b) coder pairs, 

c) coding occasions and 

d) the order of coding. 
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The present study is a sequel to a previous report published 
in this series (Koskenniemi & Komulainen 19691 • The same mate 
rial was dealt with in both studies. The videotaping took place 
in the laboratory class of the University of Helsinki Institute 
of Education. However, the reliability analysis only related 
to a total of 10 lessons in four different subjects. 

Table 1. Material of the study 



Sub j ect 


Date of 
video- 
taping 


1st coding 
T 1 


2nd coding 
T2 


T1/T2 

compar- 

ison 


1 . Civics 


□ c t . 


30 


X 


X 


X 


2. Civics 


Nov . 


27 


X 






3. Arithmetic 


Nov . 


14 


X 


X 


X 


4. Arithmetic 


Nov . 


20 


X 






5. Religion 


Oct . 


25 


X 


X 


X 


6 . Religion 


Oct . 


28 


X 






. Religion 


Nov . 


1 


X 






B. Religion 


Nov . 


B 


X 






9. Finnish 


Nov . 


14 


X 


X 


X 


10. Finnish 


Nov . 


22 


X 







The 


videotaping 


was carried 


out 


during 


the autumn 


t e rm 1967 


The interval between 


codings T1 


and 


T2 


wa s 


about 


three 


months . 


The 


observation 


in s t rumen t 


used 


in 


the 


study 


was a 


13-cat- 



egory Flanders-modif ication devised by the writer (Appendix 1). 
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2. On the Reliability Concept in Observation Studies 

As in other measurements, in observation studies it is 
imperative to ensure that the measuring instrument is re- 
sistant to chance influences. In such studies the measuring 
instrument cannot be considered to consist in the system of 
categories alone but, instead, in the whole comprising both 
the category system and its user, and it is the reliability 
of this whole that is concerned. Thus, in observation studies 
the reliability concept has a content partly different from 
the one it. has in other kinds of measurement (Stukat 1966, 1201. 

Conversion of the content of the instructional process into 
a form capable of quantitative treatment is called coding. Three 
steps can be distinguished in the coding of an interactional 
process (Guetzkow 1 950, 47 ): (1 ) u n i t i z i n g , . wh i c h means the 

division of the sequence of events into elements in accor- 
ance with a rule agreed on in advance; (2) categorizing, which 
means the placement of each unit into a classification system 
designed in advance; and (3) attributing, which means the iden- 
tification of the originator of a behaviour unit and the target 
of speech or any other sort of behaviour. 

The present study is exclusively concerned with the reli- 
ability of categorizing, since the other two steps of coding 
can be considered to take place completely reliably. In the 
method used here, the unit was not a so-called natural unit 
but a time unit. Unitizing is not carried out by the coder but 
by a seconds counter. As a large-size seconds counter provides 
a frame of reference common to all coders, all the observers 
will perform the unitizing in the same way. Attributing, again, 
is already contained in categorizing, since the category em- 
ployed also indicates which one of the two possible originators 
- the teacher or the pupil - is in question. Thus it is justi 
fiable to maintain that examination of the reliability of the 
method employed has to do with categorizing alone. 
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Observation reliability is most frequently defined as the 
degree of consistence between the results of categorizing per 
formed by two observers simultaneously but independently. In 
connection with judgments, this reliability concept is usually 
referred to as "mu 1 1 i - j udg e ” reliability (Kogan & Hunt I 960]. 
Here, however, the term inter-coder agreement , which is the 
standard expression in content analysis, will be used to empha 
size the objective and mechanical nature of observation, in 
contradistinction to the subjective element, inherent in judg 
ments. Inter-coder agreement is the similarity between the 
codings performed by two independent observers at the point 
of t ime T 1 - 

The requirement has been advanced, mainly in content ana- 
lytical studies (Berelson 1954), that the coder must be able 
to employ the category system in his codings in the same way 
at different times. Examination of this point has previously 
been possible only with literary material, and therefore it 
has not been investigated in the context of observation 
(Borgatta & Bales 1953, 566-569). Brown & Webb write: Within 

observer reliability would seem a far more useful concept than 
between-observer reliability for establishing reliability es 
timates for systematic observation" (1968, 37). Re-categorizing 

from a videotape and comparison of various codings done by the 
same person yields a reliability indicator called within-coder 
constancy . Also, agreement between codings of the same situation 
performed by different coders at different points of time can 
be examined. This reliability concept is termed between -coder 
constancy . 

Certain investigators who have used the observation tech 
nioue have spoken of reliability in a very broad sense. By the 
reliability of observation they have meant the correspondence 
between scores given by different observers in observation sit- 
uations at different point of time. That observations are made 
at different points of time means that they relate either to 
different lessons of the same teacher or to lessons in the same 
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subject by different teachers (Medley & Mitzel 1963, 253-254 
and 309-3121. The definition involved here is based on the pre- 
sumably high constancy of the trait to be measured. The high 
reliability postulated in this definition presupposes that 
observation pertains mainly to those permanent features about 
class-room work that are due to the teacher. Changes in class- 
room work from one situation to another are regarded as per- 
plexing, and attempts are made to eliminate their influence 
by carrying out several observations of the same class in 
different situations and by making use of the average of these. 
Medley & Mitzel’s definition ascribes to the error variance 
that particular feature which is of central interest in the 
present study: the fact that systematic differences occur in 

the activity of one and the same class in different situations. 
The reliability problem here concerned is not related to the 
permanence of various features but to the dependability of the 
measurement of the features. 

The time interval between the codings was about three months, 
□n both occasions, four observers coded the same lesson simul- 
taneously but independently from each other. The following 
simplified schematic representation of the two-observer case 
indicates how the various agreement indices are formed. 



T 1 



within - cod er 
constancy 




inter-coder 
agreement ( T1 ) 



within -coder 
constancy 



inter-coder 
agreement ( T2 } 
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3. Ths Overall Reliability as Estimated from 
the Marginal Distributions 



The estimation of reliability from the class frequencies 
obtained rests on the underlying assumption that, if two ob 
servers have assessed the same number of units to a given 
category, they have assigned to it the same units and, thus, 
show inter-coder agreement. The tenability of this assumption 
will be discussed in chapter 5. The similarity between the 
frequency profiles obtained by two observers of the same lesson 
- i.e., profiles reflecting how the categorization system was 
employed by these observers - indicates the degree of agreement 
between them. Bales used chi-square as an index (Bales 1950, 1101 

It has later been realized that serious shortcomings attach to 
the use of both chi-square and the contingency coefficient 
(Flanders 1965, 30-31; Cohen 1960, 38-391, and these have been 
replaced by Scott’s "pi” coefficient. Use is made thereby of 
a profile converted to percentages (to reduce the apparent dis- 
agreement arising from differences in the tempo of scoring}, 
and an attempt is also made to estimate the amount' of agreement 
due to chance (Scott 1955, 321-3251. The coefficient is obtained 
from the formula 

. _ „ Po - Pe 

^ 1 11 = 1 . 00- Pe 

where: Po = observed percentage agreement 

Pe = percentage agreement to be ex- 
pected on the basis of chance, 
as obtained from (21 

(21 Pe = Z Pi 2 

where: Pi = the proportion of the entire 

sample that falls in the i:th 
category. 



0 






, - : , T . 

;yi- 



j 

Scott's pi takes into account the fact that the agreement ] 

to be expected on the basis of chance does not equal the theo- j 

retical expectation (1/k, where k = the number of categories] | 

but varies according to the relative frequency of occurrence of 
each category (Pel in the sample to be analysed. Regarding the 
interpretation of this coefficient, Scott states that it rough - 
ly indicates the extent to which the coding reliability exceeds ; 
chance- The range of variation of the pi coefficient has prop*- j 

erties similar to those of the coefficient of correlation j 

I 

(Cohen I960, 41-43]. 



4. Results Concerning Overall Reliability 

When a picture is formed of the reliability of observation, 
it is imperative for us, in principle, to estimate the part 
played by chance. More important than to test a null hypothesis 
is, however, to examine the size of the coefficients under var- 
ied conditions, since in practice the degree of agreement in- 
variably exceeds chance, whatever the directions given for cat- 
egorizing (Schutz 1952, 120]. What is essential is to find out 

how far the observed reliability meets the reliability require- 
ments the investigator has set for the problem under study. In 
an intensive study the reliability must be comparatively high. 

In the present study the mean of all the agreement coef- 
ficients computed was .79 (Table 2]. 

The differences between the school subjects were not statis- 
tically significant (t -test]. The coders were able to arrive 
at similar categorizations, regardless of the school subject j 

concerned (Table 2]. | 

In the group of reliability coefficients indicative of agree- | 
men t between pairs o f coders , statistically significant differ- I 

~~ t 4$ 

ences were in evidence (Tables 2 and 3]. The reliability coef- | 

ficients for the pairs including Observer 2 were systematically J 

lower as compared with the others. This observer's conception 
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of the categorizing directions had differed systematically 
from the other observers’ conceptions. The coefficients in- 
dicating agreement between Observers 1 , 2 and 4 represented 

a rather satisfactory level. 



Table 2. The Pi Coefficients Computed from the Material 





Type of coefficient 


X 


S 


N 


1 . 


Total sample 


.79 


. 08 


84 


2 . 


Inter-coder agreement T1 


.79 


. 09 


60 


3 . 


Inter-coder agreement T2 


.80 


. 06 


24 


4. 


Between -cod er constancy 


.71 


.12 


48 


5 . 


Within-coder constancy 


.74 


. 13 


1 6 


6 . 


Coder pair 1 , 2 


.74 


. 1 0 


1 4 


7. 


Coder pair 1 , 3 


. 84 


. 03 


1 4 


8. 


Coder pair 1 , 4 


. 03 


. 03 


1 4 


9. 


Coder pair 2, 3 


.73 


. 00 


1 4 


10. 


Coder pair 2, 4 


.76 


. 09 


14 


1 1 . 


Coder pair 3, 4 


.83 


. 04 


1 4 


1 2 . 


Religion 


.01 


. 06 


24 


13 . 


Civics 


.77 


. 09 


12 


14. 


Arithmetic 


.76 


.12 


12 


1 5 . 


Finnish 


.00 


. 00 


1 2 



11 



O 

ERIC 



- 10 - 



Table 3. Significance of the Differences between 
Coder Pair Reliabilities, t test 



Coder pair 





1 ,2 


1,3 


1,4 


2,3 


2,4 


3,4 


1,2 




-3.09 


-3 .73 


-0.42 


0 . 47 


-3 . 27 


1 ,3 






0.30 


4.71 


3.11 


0 .61 


1,4 








4 . 61 


3 . 00 


0 .35 


2,3 










-0 . 91 


-4.20 


2,4 






= P 


. 01 




2.70 



In interpreting the table, the effect of overlapping classi- 
fications on the risk-level limits should be taken into con 
sideration [see Hays 1963, 375-376; 471-472; 403-4050. 

The coefficients for the first coding did not differ sig 
nificantly from those for the second coding. Inter-coder 
agreement was .79 durin,g the autumn term CT1) and .00 during 
ths spring term (T2). When the two coding occasions were com 
pared so as to determine the between -cod er constancy, highly 
significant differences were found ( T1 /comparison , t = 3.72, 
df = 121, p < .001; and T2/comparison , t = 3.05, df = 03, 
p < .001). Inter-coder agreement was high on both occasions, 

whereas between -coder constancy was racher poor. This state 
of affairs can be illustrated by the following graphic repre 
sentat ion - 




There is reason to assume that the codings performed dur- 
ing the autumn term, when the time lapse since coder training 
proper was comparatively short, were better estimates of 
"correct" codings than were the spring term codings, which had 
changed for all the coders in the same direction. 

Within-coder constancy was slightly higher in comparison 
with between-coder constancy, though not to a statistically 
significant extent. The result supports the interpretation 
that the coder group as a whole shifted in the same direction 
in the use of the categorizing criteria between the points of 

time T 1 and T2. 



5. An Appraisal of the Overall Re liability — Results 

The results obtained concerning overall reliability suggest 
that the alterations made in Flanders's original categories did 
not worsen agreement, at least not decisively, even though the 
number of erroneous categorizing possibilities increased and 
the degree of agreement to be expected on the basis of chance 
diminished. The coefficients obtained here can be compared with 
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reliabilities obtained in other studies. Hough, Lo hman & 

□ber state that if Scott’s pi equals .60, this can be regard- 
ed as a minimum proficiency (10691. According to Flanders, 

"a Scott coefficient of .85 or higher is a reasonable level of 
performance" (1967, 1661. The average coefficient obtained in 

the present study was slightly lower. As a general rule, how- 
ever, the reliability coefficients obtained in previ us studies 
have not quite reached the limit suggested by Flanders (Hough & 
□ber 1967, 3341. 

The change that was found to take place in the principles 
of categorization, judging by both the between -cod er and 
within-coder constancy comparisons, has relevance to coder 
training as well as to the treatment of the material. Agreement 
controls carried out at given intervals are not enough to avoid 
systematical errors in coding; on the contrary, constancy con- 
trol through time must also be resorted to. It is generally in- 
advisable to code any material in chronological order, since 
trends due to the observer’s behaviour may then be shown by the 
measurements. The emergence of such trends can effectively be 
prevented by randomizing the order of coding. 

The estimation of overall reliability rests on the assump 
tion that the units assigned by various coders to a category are 
the same if they are equal in number. Several studies speak, 
however, against this assumption. Observers may commit mistakes 
offsetting one another, with the result that the marginal dis- 
tributions will remain identical. Scott’s pi will then yield 
excessively high values. In studies where reliability has been 
examined unit by unit, agreement coefficients 10 to 20 percent 
lower in value have been obtained on an average (Waxier & 
fishier 1966). It sounds paradoxical that, where two coders do 
the coding completely at random, the pi coefficient will ap- 
proach unity, yet this is the case since the marginal distribu- 
tions will eventually become similar in shape. Also, the shift 
phenomenon observed in the use of the categorizing principles 
merits closer examination. Thus, the overall reliability method 
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must be supplemented by a method through which the reliability 
of any one individual category can be determined. 



6. The Single-unit Single-category Coding Situation 

When the coder assigns any one unit to a dichotomous category, 
this can be described, from the viewpoint of probability theory, 
in the following way (Schutz 1952, 121-122; Guetzkow 1950, 51): 




U = unit to be coded 
K = coding according 
to criterion 
R = random coding 
C = "correct” coding 
W = "incorrect" coding 



Thus, categorizing may turn out correctly either because the 
coder uses the criterion of by chance. The probability for the 
coder's employing the criterion can be computed. What we need to 
find is the probability with which the unit U is coded correctly 
by employing the criterion K. From Bayes's rule we have 



(3) 



k , c 



2 x 
x + 1 



Yet we do not know the value of x. What we can observe is only 

agreement (A), which also includes the correct coding due to 

chance factors (p ). Here, A is—the probability for the coding 

r , c 




to be correct (p c l. 1] From the rule of elimination, 

[4] P c = (x + 1 1 

and thus, 

(51 x = 2A - 1 



Substituting x into (31, 



(61 



k , c 



2A - 1 
A 



. .?■ 

| 

i 

| 

I 

£ 

1 

i 

i 

| 






whence 




Now a value can be computed for empirical agreement (A1 
such that the probability with which the coder employs the 
criterion in arriving at a correct result will be, say 
p = .90. The value of A will then be .91. The matter is 

not^invariably so simple in practice. Above we assumed that 
the correct coding was known. Agreement (A1 between two coders 
may, however, also be due to the fact that both categorize the 
same unit in-"the same way but incorrectly. Who does, in the 
last resort, decide the correctness of coding? In other words, 
who can be said to proceed in a perfectly reliable manner in 
employing the criterion? The absence of an ultimate criterion, 
in combination with the fact that the category systems are rare 
ly dichotomous, complicates the reliability analysis concerning 
the individual category. 



11 Here, p is the theoretical and A the observed value, 
c 
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On the basis of the above argument and one of Osgood’s con 
tent analysis models (Osgood 1959, 33-88], the present writer 

developed a method for the estimation of reliability values for 
each cf the thirteen categories used. The following requirements 
were imposed on the method: 

1] Each category was to be d ic ho t omi z ed , so that use could 

be made of the above probability model. 

2] The unit was to be such that each of the two coders would 

actually base his categorizing on the same unit. 

3] All the categories were to be considered simultaneously, 
to ensure that all the reliability results would rest on activ- 
ities comparable to original coding and that they would not be 
unrealistically high. 

The following two a s sumpt ions were made: 

1] Instances of reliable coding are those where two indepen- 
dently acting coders show that they have simultaneously perceiv- 
ed the occurrence of a behaviour belonging to a given category. . 
Perception of the absence of a behaviour is not regarded as reli- 
able categorizing. 

2] The true frequency of events in a category is the mean 

of the frequencies observed by the coders. The correct frequency 
is unknown, but the mean is supposed to be the best available 
estimate of it (Wright 1967, 96]. 

The method will be described below as it is applied to any 
one dichotomized category. In this study, all the categories 
were on the same coding sheet (Appendix 2), and the categoriz- 
ing principle was exactly the same for each of them. A and B 
are two coders whose agreement about one category is in question. 
At a point indicated in the videotape, the seconds counter is 
started and coding begins. During the first ten seconds the cod- 
ers have to observe the train of events on the videotape. During 
the following ten seconds they do not attend to the videotape 
but, instead, indicate on the coding sheet whether activities 
falling in the category concerned was in evidence during the 
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preceding ten-second period. Ten-second periods of observation 
and note-taking alternate for some 30 minutes. The situation 
concerning any one category can be represented as in Figure 8 
(see Kerlinger 1964, 67-80). 



Figure 8. A Schematic Representation of Reliability 
for One Category 
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A 



Nr 



N , 

N 

N, 



B 



C 

+ 'V 



basic set or. the set of all observation periods 
the subset of E consisting of those units that, 
according to coder A, included behaviours satis 
fying the categorizing criterion 

the subset of E consisting of those units that, 
according to coder B, included behaviours satis 
fying the categorizing criterion 

A B = the subset of E coded in a reliable manner, 
according to the assumption 

the number of units ( i . e . , observation periods) 
in the basic set 

the number of units in subset A 
the number of units in subset B 
the number of units in subset C 

on the assumption made, the best estimate of the 
frequency in the basic set of the behaviours be- 
longing to the category 

a* 
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Agreement for the category is obtained from 



(7) 



N. 



A 



J f N + IN! ] 

2 A B J 



indicating what proportion of the estimated number cf units 
belonging to the category was coded in a reliable manner . The 
agreement due to chance can be computed from the formula 



( 8 ) 



A 



A 



Preliminary experiments showed that the 10 second observation 
period was suitable. When use was made of all 13 categories, 
each unit period generally included 1 - 4 behaviours repre- 

senting the various categories. 



7. Results concerning the Reliabilities of the Various 
Individual Categories 



A total of 2 072 agreement coefficients (A) were computed, 
each of which rested on some 90 observation periods. The corre- 
sponding expectations (A ) were also computed, but the writer 
feels that it is unnecessary to report them here. In each case 
the agreement coefficients exceeded their expected values very 
definitely and to a statistically significant degree (Table 4). 

The coefficients were examined by means of one-way analysis 
of variance. This method was chosen because the groups to be 
compared were usually more than two in number. A total of 112 
analyses of variance were computed, 56 of which (those for the 
A coefficients] are presented in this report. Such a large num- 
ber of statistical testing is likely to include cases where 
statistical significance is due to chance. Moreover, the consec 
utive analyses are not mutually independent, and thus the prob- 
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ability of the rejection error will exceed the chosen risk lev- 
el. F values for which p > .01 were not regarded as significant. 

There were considerable differences between the categories 
in reliability. Categories 3, 4b and 9b seemed the poorest in 
this respect (Table 4). Nevertheless, the values for all the 
categories definitely exceeded the corresponding mathematical 
expectations. No standard deviations are given in the following 
tables, as these were, by and large, equal to those set out in 
Table 4. As was to be expected, "pupil answers" (0] and "teacher 
lectures" (5) were the two most clear-cut categories, judging 
by both the high coefficients and the standard deviations. 

Consideration of the coefficients by school subjects reveals 
a number of significant differences (Table 5). Arithmetic les- 
sons had apparently been the most difficult to code. This was 
perhaps due to the general nature of these lessons, which con- 
tain a lot parallel and intertwined interaction, associated 
with blackboard work, etc. Individual guidance and blackboard 
work presented particular i n terpre t a t i on a 1 difficulties in cod- 
ing. On the other hand, religion lessons seemed to be the eas- 
iest to code, judging by this material. 

Analysis of the differences between coder pai rs supported 
the finding made in considering the overall reliability that 
Observer 2 had done the coding differently from the other three 
(Table 6). The coefficients for the pairs including Observer 2 
were all lower in comparison with the rest of the coefficients. 
The other coders had carried out the codings more uniformly and 
no clear differences were perceptible between them. 

Regarding the coding occasions the following was observed. 
In ter-coder agreement diminished with time (Table 7). This was 
so for almost all categories. The only exception was provided 
by "pupil answers” (0), in the case of which the categorizing 
had remained more or less unchanged. 




Table 4. The Agreement Coefficients (A) as Computed f r om the Data and their Correspondi ng 
Expectations (A ) 
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As was to be expected between -coder constancy was slightly 
lower than within-coder constancy . Dnly in categories 4b, B 
and Z were differences not significant. No consistent shift, 
comparable to that discovered in the context of overall relia- 
bility was here in evidence. The changes that had taken place 
between T1 and T2 varied in direction, depending on the coder. 

The changes differed from one category to another, and no changes 
had occured for categories B and 4a. These two categories in fact 
represent forms of behaviour that comparatively infrequently 
necessitate an interpretation of the total situation on the part 
of the coder. 

The codings were performed during two consecutive sessions 
of an hour's duration on each of the coding days. An analysis 
of variance concerning the order of coding was possible to com- 
pute. No fatique effect was in evidence (Table B). As a matter 
of fact, the second codings wore even better. 

The overall reliability method and the determination of re- 
liability by individual categories provided a similar picture 
of the observers' coding proficiency. The computed probability 
(p 1 with which a unit is correctly categorized may be used as 
a measure of the observer's accuracy. Provided that the inter- 
coder agreement coefficients for two or more coders are known, 
the coders’ accuracy can be estimated (see Guetzkow 1950, 54 

"and Bernstein 1 969, 49-52). 

The accuracy coefficients computed from the two types of 
reliability estimates were largely similar (Table 9). 




Table 9. Coder Accuracy (1 - as computed from over all reli- 

abilities; 2 = as computed from the mean yielded by 
the method developed by the writer ) 

Method of computation of agreement 

1 2 



Coder 1 


. 92 


.87 


Coder 2 


.BO 


.71 


Coder 3 


. 91 


.86 


Coder 4 


.91 


.84 
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8. Consideration of the Results about Reliability 

Waxier 8 fishier state that "no simple prescription can be 
offered either about how to compute an index of reliability 
or what level to set acceptable" [1386). Unfortunately, inves- 
tigators have generally been content with ascertaining the pres- 
ence of some sort of reliability, without troubling themselves 
with the question of what kinds of treatment and operations are 
justified by it. It is obvious that in an intensive case study 
a comparatively high level of reliability is a necessity. An- 
other point should also be taken into consideration in inter 
actional research. What I have in mind can be demonstrated by 
brief example. Let us assume that the coder of a given material 
has obtained the following sequence of numbers, the encircled 
number representing erroneous coding. 

Connections 

Categories • 

Second -order 
connections 

In the estimation of overall reliability, erroneous coding 
causes a change in the profile and slightly reduces agreement 
[A )• When the sequence of numbers is tabulated by connections 
into the interaction matrix, two of the connections are found 
to fall in incorrect cells ( A^ ) • In a second-order Markov chain, 
three sequences of events, i.e., 3-5-5, 5-5-4 and 5-4-8, will 

be misplaced C A 2 1 - It can be shown that agreement declines fast 
when we proceed to higher-order connections. How fast the de- 
cline is depends on the starting level. 
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Figure 1. The Decline in Agreement in Shifting from the 
Profile to Connections 




It will be seen that the starting level (i.e., inter-coder 
agreement} should exceed .80; since if it is lower, the second 
order Markov chains will make little sense and the interaction 
matrix becomes undependable. 

One qualification is necessary here, however: there are var- 

ious sorts of error, although no attention to the meaningfullness 
of errors was paid in this study. 

However, perfectionism with regard to reliability is not 
likely to be an appropriate goal.’ Brown & Webb hit the mark in 
stating: "A team of observers can be trained to the point of near 

perfect agreement, but this does not eliminate :the possibility 
that instead of making numerous subjective judgments of a differ- 
ing and conflicting nature (as they did prior to learning}, they 
now make only one - the same one” (1968, 35}. The rules guiding 

the coders must not be a means to attain a high reliability, in 
stead, they should be a method intended to facilitate the meas- 
urement of a theoretically important concept. 
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Appendix 1 • 



The Employed Classification System 



Teacher 

talk 



Pupil 

talk 



□ t hers 



1 . Accepts, praises or encourages 

2 . Corrective feedback 

3 . Uses pupil ideas 

4a. Asks narrow questions . 

4b. Asks broad questions 

5 . Expresses information or own opinions 

6 . Gives directions 

7 . Criticizes pupil behaviour 

8 . Answers to a question 

9a. Relevant spontaneous talk and 

suggestions 

9b. Irrelevant spontaneous talk 

10 . Silent work, individual work or 

gu idance 

Z . Tumult, confused situation 



ERIC 



S3 



Appendix 2 . 



The Coding Sheet Employed in Estimating the Reliabilities 
of the Individual Categories 



Unit; No . 



1 

2 

3 

4 

e to . 



-12 3 4a 4b 5 B 7 8 9a 9b 10 Z 



Coder: 

□ ate: Time: 

Lesson : 

□ate and time of rec: 



Starting point : 



O 
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